Segmented Topic Model for Text Classification and Speech Recognition
نویسندگان
چکیده
This paper presents a new segmented topic model (STM) to explore the topic regularities and simultaneously partition the text or spoken documents into coherent segments. The topic model based on the latent Dirichlet allocation (LDA) is adopted to extract the topics and is strengthened by incorporating a Markov chain to detect the segments in a document. STM is trained according to a variational Bayesian procedure where a Viterbi decoder is inherent in carrying out the document segmentation. Each segment is represented by a Markov state, and so the nonstationary stylistic and contextual information are captured. The word variations within a document are compensated. In the experiments, STM outperformed LDA for text classification using ICASSP dataset and for speech recognition using WSJ corpus. 1 Introduct ion Latent Dirichlet allocation (LDA) [4] has been proposed to generalize new documents and adopted for document representation in text classification and summarization systems [5] as well as language model adaptation in speech recognition system [6]. LDA represents the documents based on bag-of-words scheme and ignores the position of words. However, the usage of words is varied in different segments of a document even if it involves the same topic. Such variations affect the correctness of document representation and word prediction. In this study, a Markov chain is merged in LDA to detect the stylistically-similar segments and estimate the time-varying word statistics of a document. Each segment indicates a specific writing or spoken style in composition of a text or spoken document. This segmented topic model (STM) exploits the topic information across documents and the word variations within a document. In STM parameter inference, a Viterbi variational inference algorithm is presented by running a Viterbi decoding stage in a variational Bayesian EM (VB-EM) procedure. The proposed STM is evaluated for text classification and speech recognition. 2 Segmented topic model The word distributions in different paragraphs of a text or spoken document are varied due to the composition style and document structure. In addition to topic information, the temporal positions of words are embedded in natural language, e.g. scientific articles and broadcast news documents. Particularly, when a scientific article is related to a specific scope of research topics, the word distributions are varied in the segments of abstract, introduction and experiment. To compensate the temporal variations, a Markov chain is merged to characterize the dynamics of words in different segments as displayed by the graphical representation of STM in Figure 1. The K-dimensional topic mixture vector θ is drawn from a Dirichlet distribution with parameter . α The topic sequence z is generated by a multinomial distribution with parameter . θ The state sequence } , , { 1 N s s L = s is generated by a Markov chain with an initial state parameter π and a S S × state transition probability matrix }. { 1 n n s s a − = A Each word is associated with a topic and a segment. The marginal likelihood of a document over unseen variables , θ z and s is yielded by θ A π θ B α θ A π B α w d s s p z p s z w p p p N
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA Comparative Study of Gender and Age Classification in Speech Signals
Accurate gender classification is useful in speech and speaker recognition as well as speech emotion classification, because a better performance has been reported when separate acoustic models are employed for males and females. Gender classification is also apparent in face recognition, video summarization, human-robot interaction, etc. Although gender classification is rather mature in a...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملRecognizing the Emotional State Changes in Human Utterance by a Learning Statistical Method based on Gaussian Mixture Model
Speech is one of the most opulent and instant methods to express emotional characteristics of human beings, which conveys the cognitive and semantic concepts among humans. In this study, a statistical-based method for emotional recognition of speech signals is proposed, and a learning approach is introduced, which is based on the statistical model to classify internal feelings of the utterance....
متن کامل